Into the Unknown - Credit Risk Modeling in Python

If you’ve ever applied for a credit card or loan, you know that financial firms process your information before making a decision. This is because giving you a loan can have a serious financial impact on their business. But how do they make a decision? In this blog, we will learn how to prepare credit application data. After that, we will apply machine learning and business rules to reduce risk and ensure profitability. We will use two data sets that emulate real credit applications while focusing on business value.

1. Exploring and Preparing Loan Data

In this first section, we will discuss the concept of credit risk and define how it is calculated. Using cross tables and plots, we will explore a real-world data set. Before applying machine learning, we will process this data by finding and resolving problems.

1.1 Explore the credit data

Well begin by loading in the dataset cr_loan.

# import required packages
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors

# load in our dataset
cr_loan = pd.read_csv('Data/cr_loan2.csv')
cr_loan

	person_age	person_income	person_home_ownership	person_emp_length	loan_intent	loan_grade	loan_amnt	loan_int_rate	loan_status	loan_percent_income	cb_person_default_on_file	cb_person_cred_hist_length
0	22	59000	RENT	123.0	PERSONAL	D	35000	16.02	1	0.59	Y	3
1	21	9600	OWN	5.0	EDUCATION	B	1000	11.14	0	0.10	N	2
2	25	9600	MORTGAGE	1.0	MEDICAL	C	5500	12.87	1	0.57	N	3
3	23	65500	RENT	4.0	MEDICAL	C	35000	15.23	1	0.53	N	2
4	24	54400	RENT	8.0	MEDICAL	C	35000	14.27	1	0.55	Y	4
...	...	...	...	...	...	...	...	...	...	...	...	...
32576	57	53000	MORTGAGE	1.0	PERSONAL	C	5800	13.16	0	0.11	N	30
32577	54	120000	MORTGAGE	4.0	PERSONAL	A	17625	7.49	0	0.15	N	19
32578	65	76000	RENT	3.0	HOMEIMPROVEMENT	B	35000	10.99	1	0.46	N	28
32579	56	150000	MORTGAGE	5.0	PERSONAL	B	15000	11.48	0	0.10	N	26
32580	66	42000	RENT	2.0	MEDICAL	B	6475	9.99	0	0.15	N	30

32581 rows × 12 columns

In this data set, loan_status shows whether the loan is currently in default with 1 being default and 0 being non-default.

We have eleven other columns within the data, and many could have a relationship with the values in loan_status. We will explore the data and these relationships more with further analysis to understand the impact of the data on credit loan defaults.

Checking the structure of the data as well as seeing a snapshot helps us better understand what’s inside the set. Let’s check what type of data we are dealing with here and then look at the first five rows:

# Check the structure of the data
cr_loan.dtypes

person_age                      int64
person_income                   int64
person_home_ownership          object
person_emp_length             float64
loan_intent                    object
loan_grade                     object
loan_amnt                       int64
loan_int_rate                 float64
loan_status                     int64
loan_percent_income           float64
cb_person_default_on_file      object
cb_person_cred_hist_length      int64
dtype: object

# Check the first five rows of the data
cr_loan.head()

	person_age	person_income	person_home_ownership	person_emp_length	loan_intent	loan_grade	loan_amnt	loan_int_rate	loan_status	loan_percent_income	cb_person_default_on_file	cb_person_cred_hist_length
0	22	59000	RENT	123.0	PERSONAL	D	35000	16.02	1	0.59	Y	3
1	21	9600	OWN	5.0	EDUCATION	B	1000	11.14	0	0.10	N	2
2	25	9600	MORTGAGE	1.0	MEDICAL	C	5500	12.87	1	0.57	N	3
3	23	65500	RENT	4.0	MEDICAL	C	35000	15.23	1	0.53	N	2
4	24	54400	RENT	8.0	MEDICAL	C	35000	14.27	1	0.55	Y	4

# get an overview of the numeric columns
cr_loan.describe()

	person_age	person_income	person_emp_length	loan_amnt	loan_int_rate	loan_status	loan_percent_income	cb_person_cred_hist_length
count	32581.000000	3.258100e+04	31686.000000	32581.000000	29465.000000	32581.000000	32581.000000	32581.000000
mean	27.734600	6.607485e+04	4.789686	9589.371106	11.011695	0.218164	0.170203	5.804211
std	6.348078	6.198312e+04	4.142630	6322.086646	3.240459	0.413006	0.106782	4.055001
min	20.000000	4.000000e+03	0.000000	500.000000	5.420000	0.000000	0.000000	2.000000
25%	23.000000	3.850000e+04	2.000000	5000.000000	7.900000	0.000000	0.090000	3.000000
50%	26.000000	5.500000e+04	4.000000	8000.000000	10.990000	0.000000	0.150000	4.000000
75%	30.000000	7.920000e+04	7.000000	12200.000000	13.470000	0.000000	0.230000	8.000000
max	144.000000	6.000000e+06	123.000000	35000.000000	23.220000	1.000000	0.830000	30.000000

# get an overview of the numeric columns
cr_loan.describe(include='object')

	person_home_ownership	loan_intent	loan_grade	cb_person_default_on_file
count	32581	32581	32581	32581
unique	4	6	7	2
top	RENT	EDUCATION	A	N
freq	16446	6453	10777	26836

Similarly, visualizations provide a high level view of the data in addition to important trends and patterns. Let’s plot a histogram of loan_amt which will provide us with a visual of the distribution of loan amounts.

# Look at the distribution of loan amounts with a histogram
n, bins, patches = plt.hist(x=cr_loan['loan_amnt'], bins='auto', color='blue',alpha=0.7, rwidth=0.85)
plt.xlabel("Loan Amount")
plt.show()

Let’s investigate the relationship between income and age, by creating a scatter plot. In this case, income is the independent variable and age is the dependent variable.

print("There are over 32 000 rows of data so the scatter plot may take a little while to plot.")

# Plot a scatter plot of income against age
plt.scatter(cr_loan['person_income'], cr_loan['person_age'],c='blue', alpha=0.5)
plt.xlabel('Personal Income')
plt.ylabel('Person Age')
plt.show()

There are over 32 000 rows of data so the scatter plot may take a little while to plot.

Starting with data exploration helps us keep from getting a.head() of ourselves! We can already see a positive correlation with age and income, which could mean these older recipients are further along in their career and therefore earn higher salaries. There also appears to be an outlier in the data.

1.2 Crosstab and pivot tables

Often, financial data is viewed as a pivot table in spreadsheets like Excel. With cross tables, we can get a high level view of selected columns and even aggregation like a count or average. For most credit risk models, especially for probability of default, columns like person_emp_length and person_home_ownership are common to begin investigating.

We will be able to see how the values are populated throughout the data, and visualize them. For now, we need to check how loan_status is affected by factors like home ownership status, loan grade, and loan percentage of income.

Let’s dive in, and create a cross table of loan_intent and loan_status :

# Create a cross table of the loan intent and loan status
pd.crosstab(cr_loan['loan_intent'], cr_loan['loan_status'], margins = True)

loan_status	0	1	All
loan_intent
DEBTCONSOLIDATION	3722	1490	5212
EDUCATION	5342	1111	6453
HOMEIMPROVEMENT	2664	941	3605
MEDICAL	4450	1621	6071
PERSONAL	4423	1098	5521
VENTURE	4872	847	5719
All	25473	7108	32581

So, the largest number of loan defaults 1,621 happen when the reason for the loan was to cover medical expenses. That is perhaps not surprising - in some cases the medical conditon might mean that the loan customer is unable to work and therefore might struggle to keep up with the loan repayments.

Let’s now look at home ownership grouped by loan_status and loan_grade :

# Create a cross table of home ownership, loan status, and grade
pd.crosstab(cr_loan['person_home_ownership'],[cr_loan['loan_status'],cr_loan['loan_grade']])

loan_status	0							1
loan_grade	A	B	C	D	E	F	G	A	B	C	D	E	F	G
person_home_ownership
MORTGAGE	5219	3729	1934	658	178	36	0	239	324	321	553	161	61	31
OTHER	23	29	11	9	2	0	0	3	5	6	11	6	2	0
OWN	860	770	464	264	26	7	0	66	34	31	18	31	8	5
RENT	3602	4222	2710	554	137	28	1	765	1338	981	1559	423	99	27

So the largest amount of loan defaults 1,559 happen where the customer is a renter, and has taken out a loan grade D. We don’t know what these gradings mean and should find out more to aid our understanding.

Let’s now look at home ownership grouped by loan_status and average loan_percent_income :

# Create a cross table of home ownership, loan status, and average percent income
pd.crosstab(cr_loan['person_home_ownership'], cr_loan['loan_status'],
              values=cr_loan['loan_percent_income'], aggfunc='mean')

loan_status	0	1
person_home_ownership
MORTGAGE	0.146504	0.184882
OTHER	0.143784	0.300000
OWN	0.180013	0.297358
RENT	0.144611	0.264859

Let’s now create a boxplot of the loan’s percent of the person’s income grouped by loan_status :

# Create a box plot of percentage income by loan status
cr_loan.boxplot(column = ['loan_percent_income'], by = 'loan_status')
plt.title('Average Percent Income by Loan Status')
plt.suptitle('')
plt.show()

It looks like the average percentage of income for defaults is higher. This could indicate those recipients have a debt-to-income ratio that’s already too high.

1.3 Finding outliers with cross tables

Now we need to find and remove outliers we suspect might be in the data. For this exercise, we can use cross tables and aggregate functions.

Have a look at the person_emp_length column. We used the aggfunc = mean argument to see the average of a numeric column before, but to detect outliers we can use other functions like min and max.

It may not be possible for a person to have an employment length of less than 0 or greater than 60. We can use cross tables to check the data and see if there are any instances of this!

Let’s print the cross table of loan_status and person_home_ownership with the max person_emp_length :

# Create the cross table for loan status, home ownership, and the max employment length
pd.crosstab(cr_loan['loan_status'],cr_loan['person_home_ownership'],
        values=cr_loan['person_emp_length'], aggfunc='max')

person_home_ownership	MORTGAGE	OTHER	OWN	RENT
loan_status
0	123.0	24.0	31.0	41.0
1	34.0	11.0	17.0	123.0

Let’s now create an array of indices for records with an employment length greater than 60. Store it as indices.

# Create an array of indices where employment length is greater than 60
indices = cr_loan[cr_loan['person_emp_length'] > 60].index

And then drop those records from the data :

# Drop the records from the data based on the indices and create a new dataframe
cr_loan = cr_loan.drop(indices)
cr_loan

	person_age	person_income	person_home_ownership	person_emp_length	loan_intent	loan_grade	loan_amnt	loan_int_rate	loan_status	loan_percent_income	cb_person_default_on_file	cb_person_cred_hist_length
1	21	9600	OWN	5.0	EDUCATION	B	1000	11.14	0	0.10	N	2
2	25	9600	MORTGAGE	1.0	MEDICAL	C	5500	12.87	1	0.57	N	3
3	23	65500	RENT	4.0	MEDICAL	C	35000	15.23	1	0.53	N	2
4	24	54400	RENT	8.0	MEDICAL	C	35000	14.27	1	0.55	Y	4
5	21	9900	OWN	2.0	VENTURE	A	2500	7.14	1	0.25	N	2
...	...	...	...	...	...	...	...	...	...	...	...	...
32576	57	53000	MORTGAGE	1.0	PERSONAL	C	5800	13.16	0	0.11	N	30
32577	54	120000	MORTGAGE	4.0	PERSONAL	A	17625	7.49	0	0.15	N	19
32578	65	76000	RENT	3.0	HOMEIMPROVEMENT	B	35000	10.99	1	0.46	N	28
32579	56	150000	MORTGAGE	5.0	PERSONAL	B	15000	11.48	0	0.10	N	26
32580	66	42000	RENT	2.0	MEDICAL	B	6475	9.99	0	0.15	N	30

32579 rows × 12 columns

We now have 32,759 rows - two have been removed following our removal of records with an employment length greater than 60.

# Create the cross table from earlier and include minimum employment length
pd.crosstab(cr_loan['loan_status'],cr_loan['person_home_ownership'],
            values=cr_loan['person_emp_length'], aggfunc=['min','max'])

	min				max
person_home_ownership	MORTGAGE	OTHER	OWN	RENT	MORTGAGE	OTHER	OWN	RENT
loan_status
0	0.0	0.0	0.0	0.0	38.0	24.0	31.0	41.0
1	0.0	0.0	0.0	0.0	34.0	11.0	17.0	27.0

Generally with credit data, key columns like person_emp_length are of high quality, but there is always room for error. With this in mind, we build our intuition for detecting outliers!

1.4 Visualizing credit outliers

We discovered outliers in person_emp_length where values greater than 60 were far above the norm. person_age is another column in which a person can use a common sense approach to say it is very unlikely that a person applying for a loan will be over 100 years old.

Visualizing the data here can be another easy way to detect outliers. We can use other numeric columns like loan_amnt and loan_int_rate to create plots with person_age to search for outliers.

Let’s create a scatter plot of person_age on the x-axis and loan_amnt on the y-axis :

# Create the scatter plot for age and amount
plt.scatter(cr_loan['person_age'], cr_loan['loan_amnt'], c='blue', alpha=0.5)
plt.xlabel("Person Age")
plt.ylabel("Loan Amount")
plt.show()

Let’s use the .drop() method from Pandas to remove the outliers :

# Use Pandas to drop the record from the data frame and create a new one
cr_loan = cr_loan.drop(cr_loan[cr_loan['person_age'] > 100].index)
cr_loan

	person_age	person_income	person_home_ownership	person_emp_length	loan_intent	loan_grade	loan_amnt	loan_int_rate	loan_status	loan_percent_income	cb_person_default_on_file	cb_person_cred_hist_length
1	21	9600	OWN	5.0	EDUCATION	B	1000	11.14	0	0.10	N	2
2	25	9600	MORTGAGE	1.0	MEDICAL	C	5500	12.87	1	0.57	N	3
3	23	65500	RENT	4.0	MEDICAL	C	35000	15.23	1	0.53	N	2
4	24	54400	RENT	8.0	MEDICAL	C	35000	14.27	1	0.55	Y	4
5	21	9900	OWN	2.0	VENTURE	A	2500	7.14	1	0.25	N	2
...	...	...	...	...	...	...	...	...	...	...	...	...
32576	57	53000	MORTGAGE	1.0	PERSONAL	C	5800	13.16	0	0.11	N	30
32577	54	120000	MORTGAGE	4.0	PERSONAL	A	17625	7.49	0	0.15	N	19
32578	65	76000	RENT	3.0	HOMEIMPROVEMENT	B	35000	10.99	1	0.46	N	28
32579	56	150000	MORTGAGE	5.0	PERSONAL	B	15000	11.48	0	0.10	N	26
32580	66	42000	RENT	2.0	MEDICAL	B	6475	9.99	0	0.15	N	30

32574 rows × 12 columns

We now have 32,574 rows - a further five have been removed following our removal of records with an age greater than 100.

Then, let’s create a scatter plot of person_age on the x-axis and loan_int_rate on the y-axis with a label for loan_status :

# Create a scatter plot of age and interest rate
colors = ["blue","red"] # default is red
plt.scatter(cr_loan['person_age'], cr_loan['loan_int_rate'],
            c = cr_loan['loan_status'],
            cmap = matplotlib.colors.ListedColormap(colors),
            alpha=0.5)
plt.xlabel("Person Age")
plt.ylabel("Loan Interest Rate")
plt.show()

Notice that in the last plot we have loan_status as a label for colors. This shows a different color depending on the class (red for default). In this case, it’s loan default and non-default, and it looks like there are more defaults with high interest rates.

1.5 Replacing missing credit data

Now, we should check for missing data. If we find missing data within loan_status, we would not be able to use the data for predicting probability of default because we wouldn’t know if the loan was a default or not. Missing data within person_emp_length would not be as damaging, but would still cause training errors.

So, let’s check for missing data in the person_emp_length column and replace any missing values with the median.

# Print a null value column array
cr_loan.columns[cr_loan.isnull().any()]

Index(['person_emp_length', 'loan_int_rate'], dtype='object')

# Print the top five rows with nulls for employment length
cr_loan[cr_loan['person_emp_length'].isnull()].head()

	person_age	person_income	person_home_ownership	person_emp_length	loan_intent	loan_grade	loan_amnt	loan_int_rate	loan_status	loan_percent_income	cb_person_default_on_file	cb_person_cred_hist_length
105	22	12600	MORTGAGE	NaN	PERSONAL	A	2000	5.42	1	0.16	N	4
222	24	185000	MORTGAGE	NaN	EDUCATION	B	35000	12.42	0	0.19	N	2
379	24	16800	MORTGAGE	NaN	DEBTCONSOLIDATION	A	3900	NaN	1	0.23	N	3
407	25	52000	RENT	NaN	PERSONAL	B	24000	10.74	1	0.46	N	2
408	22	17352	MORTGAGE	NaN	EDUCATION	C	2250	15.27	0	0.13	Y	3

# Impute the null values with the median value for all employment lengths
cr_loan['person_emp_length'].fillna((cr_loan['person_emp_length'].median()), inplace=True)

# Create a histogram of employment length
n, bins, patches = plt.hist(cr_loan['person_emp_length'], bins='auto', color='blue')
plt.xlabel("Person Employment Length")
plt.show()

We can use several different functions like mean() and median() to replace missing data. The goal here is to keep as much of our data as we can! It’s also important to check the distribution of that feature to see if it changed.

1.6 Removing missing data

We replaced missing data in person_emp_length, but in the previous section we saw that loan_int_rate has missing data as well.

Similar to having missing data within loan_status, having missing data within loan_int_rate will make predictions difficult.

Because interest rates are set by our company, having missing data in this column is very strange. It’s possible that data ingestion issues created errors, but we cannot know for sure. For now, it’s best to .drop() these records before moving forward.

# Print the number of nulls
print(cr_loan['loan_int_rate'].isnull().sum())

# Store the array on indices
indices = cr_loan[cr_loan['loan_int_rate'].isnull()].index
                  
# Save the new data without missing data
cr_loan_clean = cr_loan.drop(indices)

cr_loan_clean

	person_age	person_income	person_home_ownership	person_emp_length	loan_intent	loan_grade	loan_amnt	loan_int_rate	loan_status	loan_percent_income	cb_person_default_on_file	cb_person_cred_hist_length
1	21	9600	OWN	5.0	EDUCATION	B	1000	11.14	0	0.10	N	2
2	25	9600	MORTGAGE	1.0	MEDICAL	C	5500	12.87	1	0.57	N	3
3	23	65500	RENT	4.0	MEDICAL	C	35000	15.23	1	0.53	N	2
4	24	54400	RENT	8.0	MEDICAL	C	35000	14.27	1	0.55	Y	4
5	21	9900	OWN	2.0	VENTURE	A	2500	7.14	1	0.25	N	2
...	...	...	...	...	...	...	...	...	...	...	...	...
32576	57	53000	MORTGAGE	1.0	PERSONAL	C	5800	13.16	0	0.11	N	30
32577	54	120000	MORTGAGE	4.0	PERSONAL	A	17625	7.49	0	0.15	N	19
32578	65	76000	RENT	3.0	HOMEIMPROVEMENT	B	35000	10.99	1	0.46	N	28
32579	56	150000	MORTGAGE	5.0	PERSONAL	B	15000	11.48	0	0.10	N	26
32580	66	42000	RENT	2.0	MEDICAL	B	6475	9.99	0	0.15	N	30

29459 rows × 12 columns

Our clean dataset now has 29,459 - 3,115 have been removed as these contained nulls.

Now that the missing data and outliers have been processed, the data is ready for modeling! More often than not, financial data is fairly tidy, but it’s always good to practice preparing data for analytical work.

2. Logistic Regression for Defaults

With the loan data fully prepared, we will discuss the logistic regression model which is a standard in risk modeling. We will understand the components of this model as well as how to score its performance. Once we’ve created predictions, we can explore the financial impact of utilizing this model.

2.1 Logistic regression basics

YWe have now cleaned up the data and created the new data set cr_loan_clean.

Think back to the final scatter plot from section 1 which showed more defaults with high loan_int_rate. Interest rates are easy to understand, but how useful are they for predicting the probability of default?

Since we haven’t tried predicting the probability of default yet, let’s test out creating and training a logistic regression model with just loan_int_rate. Also check the model’s internal parameters, which are like settings, to see the structure of the model with this one column.

## import required packages
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import recall_score

import numpy as np

# Create the X and y data sets
X = cr_loan_clean[['loan_int_rate']]
y = cr_loan_clean[['loan_status']]

# Create and fit a logistic regression model
clf_logistic_single = LogisticRegression(solver='lbfgs') # solver is the optimizer like we have in Excel for gradient descent
clf_logistic_single.fit(X, np.ravel(y))

# Print the parameters of the model
print(clf_logistic_single.get_params())

# Print the intercept of the model
print(clf_logistic_single.intercept_)

{'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'auto', 'n_jobs': None, 'penalty': 'l2', 'random_state': None, 'solver': 'lbfgs', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}
[-4.45785901]

Note the solver included within the Logistic Regression model. This is like the solver function within Excel which is used to optimize the randomized initial weightings, a bit like Stochastic Gradient Descent (SGD). The particular solver or algorithm used here is lbfgs which stands for Limited-memory - Broyden–Fletcher–Goldfarb–Shanno.

Note that we use Numpy’s np.ravel to make our labels a one-dimensional array instead of a DataFrame as this is the format our model requires.

Notice that the model was able to fit to the data, and establish some parameters internally. It’s even produced a y .intercept_ value [-4.45785901] which represents the overall log-odds of non-default. What if we use more than one column?

2.2 Multivariate logistic regression

Generally, we won’t use only loan_int_rate to predict the probability of default. We will want to use all the data we have to make predictions.

With this in mind, let’s try training a new model with different columns, called features, from the cr_loan_clean data. Will this model differ from the first one? For this, we can easily check the .intercept_ of the logistic regression. Remember that this is the y-intercept of the function and the overall log-odds of non-default.

Let’s add employment length to our features:

# Create X data for the model
X_multi = cr_loan_clean[['loan_int_rate','person_emp_length']]

# Create a set of y data for training
y = cr_loan_clean[['loan_status']]

# Create and train a new logistic regression
clf_logistic_multi = LogisticRegression(solver='lbfgs').fit(X_multi, np.ravel(y))

# Print the intercept of the model
print(clf_logistic_multi.intercept_)

[-4.21645549]

Take a closer look at each model’s .intercept_ value. The values have changed! The new clf_logistic_multi model has an .intercept_ value closer to zero. This means the log odds of a non-default is approaching zero.

2.3 Creating training and test sets

We’ve just trained LogisticRegression() models on different columns. we know that the data should be separated into training and test sets. test_train_split() is used to create both at the same time. The training set is used to make predictions, while the test set is used for evaluation. Without evaluating the model, we have no way to tell how well it will perform on new loan data.

In addition to the intercept_, which is an attribute of the model, LogisticRegression() models also have the .coef_ attribute. This shows how important each training column is for predicting the probability of default.

Let’s create a data set X using interest rate, employment length, and income. Create the y set as always using our target variable (label) loan status :

# Create the X and y data sets
X = cr_loan_clean[['loan_int_rate','person_emp_length','person_income']]
y = cr_loan_clean[['loan_status']]

# Use test_train_split to create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4, random_state=123)

# Create and fit the logistic regression model to our training set
clf_logistic = LogisticRegression(solver='lbfgs').fit(X_train, np.ravel(y_train))

# Print the model coefficients
print(clf_logistic .coef_)

# Print the model intercept
print(clf_logistic.intercept_)

[[ 1.28517496e-09 -2.27622202e-09 -2.17211991e-05]]
[-3.30582292e-10]

We can see that three columns were used for training and there are three values in .coef_. This tells us how important each column, or feature, was for predicting. The more positive the value, the more it predicts defaults, for example look at the value for loan_int_rate - 1.28517496e-09. This makes sense, the higher the interest rate the higher the loan repayments and therefore an increased risk of default. On the other hand we have negative values for both employment length and income. This also makes sense. A long time in employment suggests stability and the higher a person’s income is the less likely they are to default, as they are more likely to meet the loan repayments.

We can plug these 4 values into our Logistic Regression formula to arrive at the overall prediciton for loan default :

2.4 Changing coefficients

With this understanding of the coefficients of a LogisticRegression() model, let’s have a closer look at them to see how they change depending on what columns are used for training. Will the column coefficients change from model to model?

We should .fit() two different LogisticRegression() models on different groups of columns to check. We should also consider what the potential impact on the probability of default might be.

# create X1, X2 and y datasets
X1 = cr_loan_clean[['person_income','person_emp_length','loan_amnt']]
X2 = cr_loan_clean[['person_income','loan_percent_income','cb_person_cred_hist_length']]
y = cr_loan_clean[['loan_status']]

# train, test, split
X1_train, X1_test, y_train, y_test = train_test_split(X1, y, test_size=.4, random_state=123)
X2_train, X2_test, y_train, y_test = train_test_split(X2, y, test_size=.4, random_state=123)

# Print the first five rows of each training set
print(X1_train.head())
print(X2_train.head())

# Create and train a model on the first training data
clf_logistic1 = LogisticRegression(solver='lbfgs').fit(X1_train, np.ravel(y_train))

# Create and train a model on the second training data
clf_logistic2 = LogisticRegression(solver='lbfgs').fit(X2_train, np.ravel(y_train))

# Print the coefficients of each model
print(clf_logistic1.coef_)
print(clf_logistic2.coef_)

       person_income  person_emp_length  loan_amnt
24407          72000                3.0       9000
2528           21000                2.0       3000
15961           4800                0.0       1200
6966           92000                0.0       6000
13832          96000                1.0       7000
       person_income  loan_percent_income  cb_person_cred_hist_length
24407          72000                 0.13                           6
2528           21000                 0.14                           2
15961           4800                 0.25                           2
6966           92000                 0.07                           3
13832          96000                 0.07                           4
[[-4.02643517e-05 -3.06659219e-08  1.06277246e-04]]
[[-2.17213449e-05  5.29012401e-10 -2.80735543e-09]]

Notice that the coefficient for the person_income changed when we changed the data from X1 -4.02643517e-05 to X2 -2.17213449e-05. This is a reason to keep most of the data like we did in section 1, because the models will learn differently depending on what data they’re given!

2.5 One-hot encoding credit data

Python does not know how to deal with non-numeric columns, and so we have to use one-hot encoding to convert categorical data to a number - either 0 or 1.

We can use get_dummies() from the pandas library to do this.

It’s time to prepare the non-numeric columns so they can be added to our LogisticRegression() model. Once the new columns have been created using one-hot encoding, we can concatenate them with the numeric columns to create a new data frame which will be used throughout the rest of the course for predicting probability of default.

Remember to only one-hot encode the non-numeric columns. Doing this to the numeric columns would create an incredibly wide data set!

# Create two data sets for numeric and non-numeric data
cred_num = cr_loan_clean.select_dtypes(exclude=['object'])
cred_str = cr_loan_clean.select_dtypes(include=['object'])

# One-hot encode the non-numeric columns
cred_str_onehot = pd.get_dummies(cred_str)

# Union the one-hot encoded columns to the numeric ones
cr_loan_prep = pd.concat([cred_num, cred_str_onehot], axis=1)

# Print the columns in the new data set
print(cr_loan_prep.columns)

Index(['person_age', 'person_income', 'person_emp_length', 'loan_amnt',
       'loan_int_rate', 'loan_status', 'loan_percent_income',
       'cb_person_cred_hist_length', 'person_home_ownership_MORTGAGE',
       'person_home_ownership_OTHER', 'person_home_ownership_OWN',
       'person_home_ownership_RENT', 'loan_intent_DEBTCONSOLIDATION',
       'loan_intent_EDUCATION', 'loan_intent_HOMEIMPROVEMENT',
       'loan_intent_MEDICAL', 'loan_intent_PERSONAL', 'loan_intent_VENTURE',
       'loan_grade_A', 'loan_grade_B', 'loan_grade_C', 'loan_grade_D',
       'loan_grade_E', 'loan_grade_F', 'loan_grade_G',
       'cb_person_default_on_file_N', 'cb_person_default_on_file_Y'],
      dtype='object')

Look at all those columns! If you’ve ever seen a credit scorecard, the column_name_value format should look familiar. If you haven’t seen one, look up some pictures during your next break!

2.6 Predicting probability of default

All of the data processing is complete and it’s time to begin creating predictions for probability of default. We want to train a LogisticRegression() model on the data, and examine how it predicts the probability of default.

So that we can better grasp what the model produces with predict_proba, you should look at an example record alongside the predicted probability of default. How do the first five predictions look against the actual values of loan_status?

cr_loan_prep

	person_age	person_income	person_emp_length	loan_amnt	loan_int_rate	loan_status	loan_percent_income	cb_person_cred_hist_length	person_home_ownership_MORTGAGE	person_home_ownership_OTHER	...	loan_intent_VENTURE	loan_grade_A	loan_grade_B	loan_grade_C	loan_grade_D	loan_grade_E	loan_grade_F	loan_grade_G	cb_person_default_on_file_N	cb_person_default_on_file_Y
1	21	9600	5.0	1000	11.14	0	0.10	2	0	0	...	0	0	1	0	0	0	0	0	1	0
2	25	9600	1.0	5500	12.87	1	0.57	3	1	0	...	0	0	0	1	0	0	0	0	1	0
3	23	65500	4.0	35000	15.23	1	0.53	2	0	0	...	0	0	0	1	0	0	0	0	1	0
4	24	54400	8.0	35000	14.27	1	0.55	4	0	0	...	0	0	0	1	0	0	0	0	0	1
5	21	9900	2.0	2500	7.14	1	0.25	2	0	0	...	1	1	0	0	0	0	0	0	1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
32576	57	53000	1.0	5800	13.16	0	0.11	30	1	0	...	0	0	0	1	0	0	0	0	1	0
32577	54	120000	4.0	17625	7.49	0	0.15	19	1	0	...	0	1	0	0	0	0	0	0	1	0
32578	65	76000	3.0	35000	10.99	1	0.46	28	0	0	...	0	0	1	0	0	0	0	0	1	0
32579	56	150000	5.0	15000	11.48	0	0.10	26	1	0	...	0	0	1	0	0	0	0	0	1	0
32580	66	42000	2.0	6475	9.99	0	0.15	30	0	0	...	0	0	1	0	0	0	0	0	1	0

29459 rows × 27 columns

# set features and target variable
X = cr_loan_prep.drop('loan_status', axis=1)
y = cr_loan_prep[['loan_status']]


# train, test, split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4, random_state=123)

# Train the logistic regression model on the training data
clf_logistic = LogisticRegression(solver='lbfgs').fit(X_train, np.ravel(y_train))

# Create predictions of probability for loan status using test data
preds = clf_logistic.predict_proba(X_test)

# Create dataframes of first five predictions
preds_df = pd.DataFrame(preds[:,1][0:5], columns = ['prob_default'])

# Create dataframe of first five true labels
true_df = y_test.head()

# Concatenate and print the two data frames for comparison
print(pd.concat([true_df.reset_index(drop = True), preds_df], axis = 1))

   loan_status  prob_default
0            1      0.445779
1            1      0.223447
2            0      0.288558
3            0      0.169358
4            1      0.114182

We have some predictions now, but they don’t look very accurate do they? It looks like most of the rows with loan_status at 1 have a low probability of default. How good are the rest of the predictions? Next, let’s see if we can determine how accurate the entire model is.

2.7 Default classification reporting

It’s time to take a closer look at the evaluation of our model. Here is where setting the threshold for probability of default will help us analyze the model’s performance through classification reporting.

Creating a data frame of the probabilities makes them easier to work with, because we can use all the power of pandas. Apply the threshold to the data and check the value counts for both classes of loan_status to see how many predictions of each are being created. This will help with insight into the scores from the classification report.

preds_df

	prob_default
0	0.445779
1	0.223447
2	0.288558
3	0.169358
4	0.114182

# Create a dataframe for the probabilities of default
preds_df = pd.DataFrame(preds[:,1], columns = ['prob_default'])

# Reassign loan status based on the threshold
preds_df['loan_status'] = preds_df['prob_default'].apply(lambda x: 1 if x > 0.5 else 0) # lambda alllows us to use a function without defining it using def:

# Print the row counts for each loan status
print(preds_df['loan_status'].value_counts())

# Print the classification report
target_names = ['Non-Default', 'Default'] # our labels
print(classification_report(y_test, preds_df['loan_status'], target_names=target_names))

0    11175
1      609
Name: loan_status, dtype: int64
              precision    recall  f1-score   support

 Non-Default       0.81      0.98      0.89      9198
     Default       0.71      0.17      0.27      2586

    accuracy                           0.80     11784
   macro avg       0.76      0.57      0.58     11784
weighted avg       0.79      0.80      0.75     11784

Well isn’t this a surprise! It looks like almost all of our test set was predicted to be non-default. The recall for defaults is 0.17 meaning 17% of our true defaults were predicted correctly.

2.8 Selecting report metrics

The classification_report() has many different metrics within it, but you may not always want to print out the full report. Sometimes you just want specific values to compare models or use for other purposes.

There is a function within scikit-learn that pulls out the values for you. That function is precision_recall_fscore_support() and it takes in the same parameters as classification_report.

It is imported and used like this:

# Import function
from sklearn.metrics import precision_recall_fscore_support
# Select all non-averaged values from the report
precision_recall_fscore_support(y_true,predicted_values)

# Print the classification report
target_names = ['Non-Default', 'Default']
print(classification_report(y_test, preds_df['loan_status'], target_names=target_names))

              precision    recall  f1-score   support

 Non-Default       0.81      0.98      0.89      9198
     Default       0.71      0.17      0.27      2586

    accuracy                           0.80     11784
   macro avg       0.76      0.57      0.58     11784
weighted avg       0.79      0.80      0.75     11784

# Print all the non-average values from the report
print(precision_recall_fscore_support(y_test,preds_df['loan_status']))

(array([0.80742729, 0.71264368]), array([0.98097412, 0.16782676]), array([0.8857802 , 0.27167449]), array([9198, 2586]))

# Print the first two numbers from the report
print(precision_recall_fscore_support(y_test,preds_df['loan_status'])[0])

[0.80742729 0.71264368]

Now we know how to pull out specific values from the report to either store later for comparison, or use to check against portfolio performance. Remember the impact of recall for defaults? This way, we can store that value for later calculations.

2.9 Visually scoring credit models

Now, we want to visualize the performance of the model. In ROC charts, the X and Y axes are two metrics we’ve already looked at: the false positive rate (fall-out), and the true positive rate (sensitivity).

We can create a ROC chart of it’s performance with the following code:

fallout, sensitivity, thresholds = roc_curve(y_test, prob_default)
plt.plot(fallout, sensitivity)

To calculate the AUC score, you use roc_auc_score()

# Create predictions and store them in a variable
preds= clf_logistic.predict_proba(X_test)

# Print the accuracy score the model
print(clf_logistic.score(X_test, y_test))

# Plot the ROC curve of the probabilities of default
prob_default = preds[:, 1]
fallout, sensitivity, thresholds = roc_curve(y_test, prob_default)

fig, ax = plt.subplots()
plt.plot(fallout, sensitivity, color = 'darkorange')
plt.plot([0, 1], [0, 1], linestyle='--')
ax.set_xlabel('False Positive Rate (Fallout)')
ax.set_ylabel('True Positive Rate (Sensitivity)')
plt.show()

# Compute the AUC and store it in a variable
auc = roc_auc_score(y_test, prob_default)
auc

0.8025288526816021

0.7643248801355148

So the accuracy for this model is just over 80.3% and the AUC score is 76.4%. Notice that what the ROC chart above shows us is the tradeoff between all values of our false positive rate (fallout) and true positive rate (sensitivity).

2.10 Thresholds and confusion matrices

The recall score for loan defaults is the number of correctly predicted defaults divided by the total number of defaults. Note that if we were to predict ALL of our loans to default, then our recall score would be 100%!
The recall score for non-defaults is the number of correctly predicted non-defaults, divided by the total number of non-defaults.

The precision score for loan defaults is the number of correctly predicted defaults divided by the total number of predicted defaults.
The precision score for non-loan defaults is the number of correctly predicted non-defaults, divided by the total number of predicted non-defaults.

We’ve looked at setting thresholds for defaults, but how does this impact overall performance? To do this, we can start by looking at the effects with confusion matrices. Set different values for the threshold on probability of default, and use a confusion matrix to see how the changing values affect the model’s performance.

# Set the threshold for defaults to 0.5
preds_df['loan_status'] = preds_df['prob_default'].apply(lambda x: 1 if x > 0.5 else 0)

# Print the confusion matrix
print(confusion_matrix(y_test,preds_df['loan_status']))

[[9023  175]
 [2152  434]]

print(recall_score(y_test,preds_df['loan_status']))

0.16782675947409126

# Set the threshold for defaults to 0.4
preds_df['loan_status'] = preds_df['prob_default'].apply(lambda x: 1 if x > 0.4 else 0)

# Print the confusion matrix
y_test,preds_df['loan_status']

(       loan_status
 31622            1
 24935            1
 15342            0
 3460             0
 16424            1
 ...            ...
 25231            0
 30380            0
 29849            0
 1780             0
 11446            0
 
 [11784 rows x 1 columns],
 0        1
 1        0
 2        0
 3        0
 4        0
         ..
 11779    0
 11780    0
 11781    1
 11782    0
 11783    0
 Name: loan_status, Length: 11784, dtype: int64)

print(recall_score(y_test,preds_df['loan_status']))

0.46403712296983757

2.11 How thresholds affect performance

Setting the threshold to 0.4 shows promising results for model evaluation. Now we can assess the financial impact using the default recall which is selected from the classification reporting using the function precision_recall_fscore_support().

For this, we will estimate the amount of unexpected loss using the default recall to find what proportion of defaults you did not catch with the new threshold. This will be a dollar amount which tells you how much in losses you would have if all the unfound defaults were to default all at once.

# Reassign the values of loan status based on the new threshold
preds_df['loan_status'] = preds_df['prob_default'].apply(lambda x: 1 if x > 0.4 else 0)

# Store the number of loan defaults from the prediction data
num_defaults = preds_df['loan_status'].value_counts()[1]

# Store the default recall from the classification report
default_recall = precision_recall_fscore_support(y_test,preds_df['loan_status'])[1][1]

# Calculate the estimated impact of the new default recall rate
print(cr_loan_prep['loan_amnt'].mean() * num_defaults * (1 - default_recall))

9872265.223119883

By our estimates, this loss would be around $9.8 million. That seems like a lot! Try rerunning this code with threshold values of 0.3 and 0.5. Do you see the estimated losses changing? How do we find a good threshold value based on these metrics alone?

2.12 Threshold selection

We know there is a trade off between metrics like default recall, non-default recall, and model accuracy. One easy way to approximate a good starting threshold value is to look at a plot of all three using matplotlib. With this graph, you can see how each of these metrics look as you change the threshold values and find the point at which the performance of all three is good enough to use for the credit data.

# Instantiate values
thresh =  [0.2, 0.225, 0.25, 0.275, 0.3, 0.325, 0.35, 0.375, 0.4, 0.425, 0.45, 0.475, 0.5, 0.525, 0.55, 0.575, 0.6, 0.625, 0.65]
def_recalls = [0.7981438515081206, 0.7583139984532096, 0.7157772621809745, 0.6759474091260634, 0.6349574632637278, 0.594354215003867, 0.5467904098994586, 0.5054137664346481,
 0.46403712296983757, 0.39984532095901004, 0.32211910286156226, 0.2354988399071926, 0.16782675947409126, 0.1148491879350348, 0.07733952049497293, 0.05529775715390565,
 0.03750966744006187, 0.026295436968290797, 0.017788089713843776]
nondef_recalls = [0.5342465753424658, 0.5973037616873234, 0.6552511415525114, 0.708306153511633, 0.756468797564688, 0.8052837573385518, 0.8482278756251359, 0.8864970645792564,
 0.9215046749293324, 0.9492280930637095, 0.9646662317895195, 0.9733637747336378, 0.9809741248097412, 0.9857577734290063, 0.9902152641878669, 0.992280930637095, 0.9948901935203305,
 0.9966297021091541, 0.997499456403566]
accs = [0.5921588594704684, 0.6326374745417516, 0.6685336048879837, 0.7012050237610319, 0.7298031228784793, 0.7589952477936185, 0.7820773930753564, 0.8028682959945689,
 0.8211133740665308, 0.8286659877800407, 0.8236591989137814, 0.811439239646979, 0.8025288526816021, 0.7946367956551256, 0.7898845892735913, 0.7866598778004074,
 0.7847929395790902, 0.7836897488119484, 0.7825016972165648]
ticks = [0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65]

Note, the above values were pre-loaded within the DataCamp workspace for this lesson, however I have recreated here for reproduceability.

plt.plot(thresh,def_recalls)
plt.plot(thresh,nondef_recalls)
plt.plot(thresh,accs)
plt.xlabel("Probability Threshold")
plt.xticks(ticks)
plt.legend(["Default Recall","Non-default Recall","Model Accuracy"])
plt.show()

Have a closer look at this plot. In fact, expand the window to get a really good look. Think about the threshold values from thresh and how they affect each of these three metrics. Approximately what starting threshold value would maximize these scores evenly?

This is the easiest pattern to see on this graph, because it’s the point where all three lines converge. This threshold would make a great starting point, but declaring all loans about 0.275 to be a default is probably not practical.

3. Gradient Boosted Trees Using XGBoost

Decision trees are another standard credit risk model. We will go beyond decision trees by using the trendy XGBoost package in Python to create gradient boosted trees. After developing sophisticated models, we will stress test their performance and discuss column selection in unbalanced data.

3.1 Trees for defaults

We will now train a Gradient Boosted Tree model on the credit data, and see a sample of some of the predictions. Do you remember when we first looked at the predictions of the logistic Regression model? They didn’t look good. Do you think this model will be different?

# Train a model
!pip install xgboost
import xgboost as xgb
clf_gbt = xgb.XGBClassifier().fit(X_train, np.ravel(y_train))

# Predict with a model
gbt_preds = clf_gbt.predict_proba(X_test)

# Create dataframes of first five predictions, and first five true labels
preds_df = pd.DataFrame(gbt_preds[:,1][0:5], columns = ['prob_default'])
true_df = y_test.head()

# Concatenate and print the two data frames for comparison
print(pd.concat([true_df.reset_index(drop = True), preds_df], axis = 1))

Requirement already satisfied: xgboost in /home/stephen137/mambaforge/lib/python3.10/site-packages (1.7.3)
Requirement already satisfied: numpy in /home/stephen137/mambaforge/lib/python3.10/site-packages (from xgboost) (1.22.4)
Requirement already satisfied: scipy in /home/stephen137/mambaforge/lib/python3.10/site-packages (from xgboost) (1.9.1)
   loan_status  prob_default
0            1      0.990942
1            1      0.983987
2            0      0.000807
3            0      0.001239
4            1      0.084892

The predictions don’t look the same as with the LogisticRegression(), do they? Notice that this model is already accurately predicting the probability of default for some loans with a true value of 1 in loan_status.

3.2 Gradient boosted portfolio performance

At this point we’ve looked at predicting probability of default using both a LogisticRegression() and XGBClassifier(). We’ve looked at some scoring and have seen samples of the predictions, but what is the overall affect on portfolio performance? Try using expected loss as a scenario to express the importance of testing different models.

A data frame called portfolio has been cretaed to combine the probabilities of default for both models, the loss given default (assume 20% for now), and the loan_amnt which will be assumed to be the exposure at default.

# Print the first five rows of the portfolio data frame
print(portfolio.head())

# Create expected loss columns for each model using the formula
portfolio['gbt_expected_loss'] = portfolio['gbt_prob_default'] * portfolio['lgd'] * portfolio['loan_amnt']
portfolio['lr_expected_loss'] = portfolio['lr_prob_default'] * portfolio['lgd'] * portfolio['loan_amnt']

# Print the sum of the expected loss for lr
print('LR expected loss: ', np.sum(portfolio['lr_expected_loss']))

# Print the sum of the expected loss for gbt
print('GBT expected loss: ', np.sum(portfolio['gbt_expected_loss']))

LR expected loss: 5596776.979852879 GBT expected loss: 5383982.809227714

It looks like the total expected loss for the XGBClassifier() model is quite a bit lower. When we talk about accuracy and precision, the goal is to generate models which have a low expected loss. Looking at a classification_report() helps as well.

3.3 Assessing gradient boosted trees

So we’ve now used XGBClassifier() models to predict probability of default. These models can also use the .predict() method for creating predictions that give the actual class for loan_status.

We should check the model’s initial performance by looking at the metrics from the classification_report(). Keep in mind that we have not set thresholds for these models yet.

# Predict the labels for loan status
gbt_preds= clf_gbt.predict(X_test)

# Check the values created by the predict method
print(gbt_preds)

# Print the classification report of the model
target_names = ['Non-Default', 'Default']
print(classification_report(y_test, gbt_preds, target_names=target_names))

Have a look at the precision and recall scores! Remember the low default recall values we were getting from the LogisticRegression()? This model already appears to have serious potential.

3.4 Column importance and default prediction

When using multiple training sets with many different groups of columns, it’s important to keep an eye on which columns matter and which do not. It can be expensive or time-consuming to maintain a set of columns even though they might not have any impact on loan_status.

The X data for this exercise was created with the following code:

X = cr_loan_prep[['person_income','loan_int_rate',
              'loan_percent_income','loan_amnt',
              'person_home_ownership_MORTGAGE','loan_grade_F']]

Train an XGBClassifier() model on this data, and check the column importance to see how each one performs to predict loan_status.

# redefine our feature columns
X = cr_loan_prep[['person_income','loan_int_rate','loan_percent_income','loan_amnt','person_home_ownership_MORTGAGE','loan_grade_F']]

# train, test, split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4, random_state=123)

# Create and train the model on the training data
gbt = xgb.XGBClassifier().fit(X_train,np.ravel(y_train))

# Print the column importances from the model
print(gbt.get_booster().get_score(importance_type = 'weight'))

So, the importance for loan_grade_F is only 9 in this case. This could be because there are so few of the F-grade loans. While the F-grade loans don’t add much to predictions here, they might affect the importance of other training columns.

3.5 Visualizing column importance

When the model is trained on different sets of columns it changes the performance, but does the importance for the same column change depending on which group it’s in?

The data sets X2 and X3 have been created with the following code:

X2 = cr_loan_prep[['loan_int_rate','person_emp_length']] \
X3 = cr_loan_prep[['person_income','loan_int_rate','loan_percent_income']]

# Redefine feature columns
X2 = cr_loan_prep[['loan_int_rate','person_emp_length']] 
X3 = cr_loan_prep[['person_income','loan_int_rate','loan_percent_income']]

# train, test, split
X2_train, X2_test, y_train, y_test = train_test_split(X2, y, test_size=.4, random_state=123)
X3_train, X3_test, y_train, y_test = train_test_split(X3, y, test_size=.4, random_state=123)

Understanding how different columns are used to arrive at a loan_status prediction is very important for model interpretability.

# Train a model on the X data with 2 columns
gbt2 = xgb.XGBClassifier().fit(X2_train,np.ravel(y_train))

# Plot the column importance for this model
xgb.plot_importance(gbt2, importance_type = 'weight')
plt.show()

# Train a model on the X data with 3 columns
gbt3 = xgb.XGBClassifier().fit(X3_train,np.ravel(y_train))

# Plot the column importance for this model
xgb.plot_importance(gbt3, importance_type = 'weight')
plt.show()

Take a closer look at the plots. Did you notice that the importance of loan_int_rate went from 1490 to 1013? Initially, this was the most important column, but person_income ended up taking the top spot here.

3.6 Column selection and model performance

Creating the training set from different combinations of columns affects the model and the importance values of the columns. Does a different selection of columns also affect the F-1 scores, the combination of the precision and recall, of the model? You can answer this question by training two different models on two different sets of columns, and checking the performance.

Inaccurately predicting defaults as non-default can result in unexpected losses if the probability of default for these loans was very low. You can use the F-1 score for defaults to see how the models will accurately predict the defaults.

# Predict the loan_status using each model
gbt_preds = gbt.predict(X_test)
gbt2_preds = gbt2.predict(X2_test)

# Print the classification report of the first model
target_names = ['Non-Default', 'Default']
print(classification_report(y_test,gbt_preds, target_names=target_names))

# Print the classification report of the second model
print(classification_report(y_test, gbt2_preds, target_names=target_names))

Originally, it looked like the selection of columns affected model accuracy the most, but now we see that the selection of columns also affects recall by quite a bit.

3.7 Cross validating credit models

We cannot create more loan data to help us to develop our model but we can use cross-validation to simulate how our model will perform on new laon data before it comes in.

# Set the values for number of folds and stopping iterations
n_folds = 5
early_stopping = 10

# define params dictionary
params = {'objective': 'binary:logistic', 'seed': 123, 'eval_metric': 'auc'}

# Create the DTrain matrix for XGBoost
DTrain = xgb.DMatrix(X_train, label = y_train)

# Create the data frame of cross validations
cv_df = xgb.cv(params, DTrain, num_boost_round = 5, nfold=n_folds,
            early_stopping_rounds=early_stopping)

# Print the cross validations data frame
print(cv_df)

Looks good! Note how the AUC for both train-auc-mean and test-auc-mean improves at each iteration of cross-validation. The improvements suggest that our model has stability, however if we increase iterations will our scores improve until they eventually reach 1.0 ?

3.8 Limits to cross-validation testing

We can specify very large numbers for both nfold and num_boost_round if we want to perform an extreme amount of cross-validation. The data frame cv_results_big was created with the following code:

cv = xgb.cv(params, DTrain, num_boost_round = 600, nfold=10,
        shuffle = True)

Here, cv() performed 600 iterations of cross-validation! The parameter shuffle tells the function to shuffle the records each time.

Have a look at this data to see what the AUC are, and check to see if they reach 1.0 using cross validation. We should also plot the test AUC score to see the progression.

cv_results_big = xgb.cv(params, DTrain, num_boost_round = 600, nfold=10, shuffle = True)

# Print the first five rows of the CV results data frame
cv_results_big.head()

# Calculate the mean of the test AUC scores
np.mean(cv_results_big.head()['test-auc-mean']).round(2)

# Plot the test AUC scores for each iteration
plt.plot(cv_results_big['test-auc-mean'])
plt.title('Test AUC Score Over 600 Iterations')
plt.xlabel('Iteration Number')
plt.ylabel('Test AUC Score')
plt.show()

Notice that the test AUC score never quite reaches 1.0 and begins to decrease slightly after 100 iterations. This is because this much cross-validation can actually cause the model to overfit. So, there is a limit to how much cross-validation we should do.

3.9 Cross-validation scoring

Now, we should use cross-validation scoring with cross_val_score() to check the overall performance.

This is exercise presents an excellent opportunity to test out the use of the hyperparameters learning_rate and max_depth. Remember, hyperparameters are like settings which can help create optimum performance.

cr_loan_prep

X = cr_loan_prep.drop('loan_status', axis =1)
# train, test, split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4, random_state=123)

X_train

# import
from sklearn.model_selection import cross_val_score

# Create a gradient boosted tree model using two hyperparameters
gbt = xgb.XGBClassifier(learning_rate = 0.1, max_depth = 7)

# Calculate the cross validation scores for 4 folds
cv_scores = cross_val_score(gbt, X_train, np.ravel(y_train), cv=4)

# Print the cross validation scores
print(cv_scores)

# Print the average accuracy and standard deviation of the scores
print("Average accuracy: %0.2f (+/- %0.2f)" % (cv_scores.mean(),
                                              cv_scores.std() * 2))

Our average cv_score for this model is getting higher! With only a couple of hyperparameters and cross-validation, we can get the average accuracy up to 93%. This is a great way to validate how robust the model is.

3.10 Undersampling training data

We just used cross validation to check the robustness of our model but let’s look at how the data impacts the robustness of our model. In our training dataset there are far more non-defaults than defaults. This is referred to as class imbalance and is a problem.

y_train['loan_status'].value_counts()

We have 13,798 non defaults (78%) and 3,877 defaults (22%).

Our tree models use a function called log-loss which our model tries to minimise. Take the example below, where we have one default and one non-default. Each of the predictions is equally far away from the true outcome, and so the log-losss value is the same - however, the financial implications of an incorrect default prediction are much more severe than an incorrect non-default prediction.

One strategy we can adopt to restore the balance of our training data is undersampling :

It’s time to undersample the training set with a few lines of code from pandas. Once the undersampling is complete, we can check the value counts for loan_status to verify the results.

# Concat the training sets
X_y_train = pd.concat([X_train.reset_index(drop = True),
                       y_train.reset_index(drop = True)], axis = 1)

# Get counts of non default and defaults
count_nondefault, count_default = X_y_train['loan_status'].value_counts()

# Create data sets for defaults and non-defaults
nondefaults = X_y_train[X_y_train['loan_status'] == 0]
defaults =  X_y_train[ X_y_train['loan_status'] == 1]

# Undersample the non-defaults
nondefaults_under = nondefaults.sample(count_default)

# Concatenate the undersampled nondefaults with defaults
X_y_train_under = pd.concat([nondefaults_under.reset_index(drop = True),
                             defaults.reset_index(drop = True)], axis = 0)

# Print the value counts for loan status
print(X_y_train_under['loan_status'].value_counts())

Great. We now have a training set with an equal number of defaults and non-defaults. Let’s test out some machine learning models on this new undersampled data set and compare their performance to the models trained on the regular data set.

3.11 Undersampled tree performance

We’ve undersampled the training set and trained a model on the undersampled set.

The performance of the model’s predictions not only impact the probability of default on the test set, but also on the scoring of new loan applications as they come in. We also now know that it is even more important that the recall of defaults be high, because a default predicted as non-default is more costly.

The next crucial step is to compare the new model’s performance to the original model.

# Check the classification reports
target_names = ['Non-Default', 'Default']

# print classification report for old model
print(classification_report(y_test, gbt_preds, target_names=target_names))

             precision    recall  f1-score   support

 Non-Default       0.93      0.99      0.96      9198
     Default       0.95      0.73      0.83      2586

    accuracy                           0.93     11784
   macro avg       0.94      0.86      0.89     11784
weighted avg       0.93      0.93      0.93     11784

# print classification report for new model
print(classification_report(y_test, gbt2_preds, target_names=target_names))

             precision    recall  f1-score   support

 Non-Default       0.95      0.91      0.93      9198
     Default       0.72      0.84      0.77      2586

    accuracy                           0.89     11784
   macro avg       0.83      0.87      0.85     11784
weighted avg       0.90      0.89      0.89     11784

# Print the confusion matrix for old model
print(confusion_matrix(y_test,gbt_preds))

[[9105 93] [ 691 1895]]

# Print the confusion matrix for new model
print(confusion_matrix(y_test,gbt2_preds))

[[8338 860] [ 426 2160]]

# Print and compare the AUC scores of the old model
print(roc_auc_score(y_test, gbt_preds))

0.8613405315086655

# Print and compare the AUC scores of the new model
print(roc_auc_score(y_test,gbt2_preds))

0.870884117348218

Looks like this is classified as a success! Undersampling the training data results in more false positives, but the recall for defaults and the AUC score are both higher than the original model. This means overall it predicts defaults much more accurately.

3.12 Undersampling intuition

Now we’ve seen the effects of undersampling the training set to improve default prediction. We undersampled the training data set X_train, and it had a positive impact on the new model’s AUC score and recall for defaults. The training data had class imbalance which is normal for most credit loan data.

We did not undersample the test data X_test. Why not undersample the test set as well?

The test set represents the type of data that will be seen by the model in the real world, so changing it would test the model on unrealistic data.

4. Model evaluation and implementation

After developing and testing two powerful machine learning models, we use key performance metrics to compare them. Using advanced model selection techniques specifically for financial modeling, we will select one model. With that model, we will:

develop a business strategy
estimate portfolio value, and
minimize expected loss

4.1 Comparing model reports

We’ve used logistic regression models and gradient boosted trees. It’s time to compare these two to see which model will be used to make the final predictions.

One of the easiest first steps for comparing different models’ ability to predict the probability of default is to look at their metrics from the classification_report(). With this, we can see many different scoring metrics side-by-side for each model. Because the data and models are normally unbalanced with few defaults, focus on the metrics for defaults for now.

# Print the logistic regression classification report
target_names = ['Non-Default', 'Default']
print(classification_report(y_test, preds_df_lr['loan_status'], target_names=target_names))

               precision    recall  f1-score   support

 Non-Default       0.86      0.92      0.89      9198
     Default       0.62      0.46      0.53      2586

    accuracy                           0.82     11784
   macro avg       0.74      0.69      0.71     11784
weighted avg       0.81      0.82      0.81     11784

# Print the gradient boosted tree classification report
print(classification_report(y_test, preds_df_gbt['loan_status'], target_names=target_names))

              precision    recall  f1-score   support

 Non-Default       0.93      0.99      0.96      9198
     Default       0.94      0.73      0.82      2586

    accuracy                           0.93     11784
   macro avg       0.93      0.86      0.89     11784
weighted avg       0.93      0.93      0.93     11784

# Print the default F-1 scores for the logistic regression
print(precision_recall_fscore_support(y_test,preds_df_lr['loan_status'], average = 'macro')[2])

0.7108943782814463

# Print the default F-1 scores for the gradient boosted tree
print(precision_recall_fscore_support(y_test,preds_df_gbt['loan_status'], average = 'macro')[2])

0.8909014142736051

There is a noticeable difference between these two models. The scores from the classification_report() are all higher for the gradient boosted tree. This means the tree model is better in all of these aspects. Let’s check the ROC curve.

4.2 Comparing with ROCs and AUCs

We should use ROC charts and AUC scores to compare the two models. Sometimes, visuals can really help you and potential business users understand the differences between the various models under consideration.

With the graph in mind, we will be more equipped to make a decision. The lift is how far the curve is from the random prediction. The AUC is the area between the curve and the random prediction. The model with more lift, and a higher AUC, is the one that’s better at making predictions accurately.

# ROC chart components
fallout_lr, sensitivity_lr, thresholds_lr = roc_curve(y_test, clf_logistic_preds)
fallout_gbt, sensitivity_gbt, thresholds_gbt = roc_curve(y_test, clf_gbt_preds)

# ROC Chart with both
plt.plot(fallout_lr, sensitivity_lr, color = 'blue', label='%s' % 'Logistic Regression')
plt.plot(fallout_gbt, sensitivity_gbt, color = 'green', label='%s' % 'GBT')
plt.plot([0, 1], [0, 1], linestyle='--', label='%s' % 'Random Prediction')
plt.title("ROC Chart for LR and GBT on the Probability of Default")
plt.xlabel('Fall-out')
plt.ylabel('Sensitivity')
plt.legend()
plt.show()

# Print the logistic regression AUC with formatting
print("Logistic Regression AUC Score: %0.2f" % roc_auc_score(y_test, clf_logistic_preds))

# Print the gradient boosted tree AUC with formatting
print("Gradient Boosted Tree AUC Score: %0.2f" % roc_auc_score(y_test, clf_gbt_preds))

Logistic Regression AUC Score: 0.76 Gradient Boosted Tree AUC Score: 0.94

Look at the ROC curve for the gradient boosted tree. Not only is the lift much higher, the calculated AUC score is also quite a bit higher. It’s beginning to look like the gradient boosted tree is best. Let’s check the calibration to be sure.

4.3 Calibration curves

We now know that the gradient boosted tree clf_gbt has the best overall performance. You need to check the calibration of the two models to see how stable the default prediction performance is across probabilities. We can use a chart of each model’s calibration to check this by calling the calibration_curve() function.

Calibration curves can require many lines of code in python, so we will go through each step slowly to add the different components.

# Set calibration curve outputs
frac_of_pos_lr = ([0.07886231, 0.06610942, 0.10835913, 0.13505074, 0.16063348, 0.18333333, 0.21268657, 0.24099099, 0.48036649, 0.72677596,
       0.73354232, 0.70547945, 0.68 , 0.73913043, 0.55555556, 0.4 ])

mean_pred_val_lr = ([0.02111464, 0.07548788, 0.12582662, 0.17502903, 0.22449499, 0.27491676, 0.32488847, 0.37486698, 0.42302912, 0.47397249,
       0.52304288, 0.57259508, 0.62200793, 0.67156702, 0.71909209, 0.77024859])


frac_of_pos_gbt = ([0.01916168, 0.06385752, 0.12795793, 0.17460317, 0.21806854, 0.32620321, 0.32653061, 0.33333333, 0.40677966, 0.43181818,
       0.6, 0.42105263, 0.31578947, 0.6875, 0.78571429, 0.83333333, 0.90697674, 0.95238095, 0.98850575, 1.])

mean_pred_val_gbt = ([0.01937249, 0.07211534, 0.12178284, 0.17298488, 0.22318428, 0.2716055, 0.32285183, 0.369344, 0.42164062, 0.47158214,
       0.52230485, 0.57041398, 0.62149714, 0.67234764, 0.72826275, 0.77567046, 0.82827961, 0.87636708, 0.92830987, 0.98579916])

# Create the calibration curve plot with the guideline
plt.plot([0, 1], [0, 1], 'k:', label= "Perfectly calibrated")    
plt.ylabel('Fraction of positives')
plt.xlabel('Average Predicted Probability')
plt.legend()
plt.title('Calibration Curve')
plt.show()

# Add the calibration curve for the logistic regression to the plot
plt.plot([0, 1], [0, 1], 'k:', label='Perfectly calibrated')    
plt.plot(mean_pred_val_lr, frac_of_pos_lr,
         's-', label='%s' % 'Logistic Regression')
plt.ylabel('Fraction of positives')
plt.xlabel('Average Predicted Probability')
plt.legend()
plt.title('Calibration Curve')
plt.show()

# Add the calibration curve for the gradient boosted tree
plt.plot([0, 1], [0, 1], 'k:', label='Perfectly calibrated')    
plt.plot(mean_pred_val_lr, frac_of_pos_lr,
         's-', label='%s' % 'Logistic Regression')
plt.plot(mean_pred_val_gbt, frac_of_pos_gbt,
         's-', label='%s' % 'Gradient Boosted tree')
plt.ylabel('Fraction of positives')
plt.xlabel('Average Predicted Probability')
plt.legend()
plt.title('Calibration Curve')
plt.show()

Take a good look at this. Notice that for the logistic regression, the calibration for probabilities starts off great but then gets more erratic as it the average probability approaches 0.4. Something similar happens to the gradient boosted tree around 0.5, but the model eventually stabilizes. We will be focusing only on the gbt model from now on.

4.4 Acceptance rates

Setting an acceptance rate and calculating the threshold for that rate can be used to set the percentage of new loans we want to accept.

In the above example, the distribution of our loans is represented in terms of their probability of default. 85% of our loans are to the left of the dashed line, with and 15% are to the right of the dashed line. If our policy is to ensure that 15% of loans are rejected, then the dashed line represents the acceptance threshold. We can see by reading off the graph that we should therefore reject any new loan applications with predicted probability of default of around 78% or above.

In our example, the exact acceptance threshold can be calculated using Numpy quantile.

import numpy as np
threshold = np.quantile(prob_default, 0.85)

We would then reassign our loan_status values as we did before arbitrarily, with the calculated threshold:

preds_df['loan_status'] = preds_df['prob_default'].apply(lambda x: 1 if x > 0.78 else 0)

Let’s see how this works in more detail by applying to our loan data. For this exercise, assume the test data is a fresh batch of new loans. We will need to use the quantile() function from numpy to calculate the threshold.

The threshold should be used to assign new loan_status values. Does the number of defaults and non-defaults in the data change?

# Check the statistics of the probabilities of default
print(test_pred_df['prob_default'].describe())

# Calculate the threshold for a 85% acceptance rate
threshold_85= np.quantile(test_pred_df['prob_default'], 0.85)

0.8039963573217376

# Apply acceptance rate threshold
test_pred_df['pred_loan_status'] = test_pred_df['prob_default'].apply(lambda x: 1 if x > threshold_85 else 0)

# Print the counts of loan status after the threshold
print(test_pred_df['pred_loan_status'].value_counts())

In the results of .describe() we can see how it’s not until 75% that we start to see double-digit numbers. That’s because the majority of our test set is non-default loans. Next let’s look at how the acceptance rate and threshold split up the data.

4.5 Visualizing quantiles of acceptance

We know how quantile() works to compute a threshold, and we’ve seen an example of what it does to split the loans into accepted and rejected. What does this threshold look like for the test set, and how can you visualize it? To check this, we can create a histogram of the probabilities and add a reference line for the threshold. With this, we can visually show where the threshold exists in the distribution.

# Plot the predicted probabilities of default
plt.hist(clf_gbt_preds, color = 'blue', bins = 40)

# Calculate the threshold with quantile
threshold = np.quantile(clf_gbt_preds, 0.85)

# Add a reference line to the plot for the threshold
plt.axvline(x = threshold, color = 'red')
plt.show()

Here, we can clearly see where the threshold is on the range of predicted probabilities - 0.804 as calculated in section 4.4 - and indicated by the red line. Not only can we see how many loans will be accepted (left side), but also how many loans will be rejected (right side).

4.6 Bad rates

With acceptance rate in mind, we can now analyze the bad rate within the accepted loans. This way we will be able to see the percentage of defaults that have been accepted. Think about the impact of the acceptance rate and bad rate. We set an acceptance rate to have fewer defaults in the portfolio because defaults are more costly. Will the bad rate be less than the percentage of defaults in the test data?

# Print the top 5 rows of the new data frame
print(test_pred_df.head())

    true_loan_status  prob_default  pred_loan_status
0                 1         0.982                 1
1                 1         0.975                 1
2                 0         0.003                 0
3                 0         0.005                 0
4                 1         0.120                 0

# Create a subset of only accepted loans
accepted_loans = test_pred_df[test_pred_df['pred_loan_status'] == 0]

# Calculate the bad rate
print(np.sum(accepted_loans['true_loan_status']) / accepted_loans['true_loan_status'].count())

0.08256789137380191

This bad rate doesn’t look half bad! The bad rate with the threshold set by the 85% quantile() is just over 8%. This means that of all the loans we’ve decided to accept from the test set, only 8% were actual defaults! If we accepted all loans, the percentage of defaults would be around 22%.

4.7 Acceptance rate impact

Now, look at the loan_amnt of each loan to understand the impact on the portfolio for the acceptance rates. We can use cross tables with calculated values, like the average loan amount, of the new set of loans X_test. For this, we will multiply the number of each with an average loan_amnt value.

When printing these values, try formatting them as currency so that the numbers look more realistic. After all, credit risk is all about money. This is accomplished with the following code:

pd.options.display.float_format = '${:,.2f}'.format

# Print the statistics of the loan amount column
print(test_pred_df['loan_amnt'].describe())

# Store the average loan amount
avg_loan = np.mean(test_pred_df['loan_amnt'])

# Set the formatting for currency, and print the cross tab
pd.options.display.float_format = '${:,.2f}'.format

# print the cross table
print(pd.crosstab(test_pred_df['true_loan_status'],
                 test_pred_df['pred_loan_status_15']).apply(lambda x: x * avg_loan, axis = 0))

With this, we can see that our bad rate of about 8% represents an estimated loan value of about 7.9 million dollars. This may seem like a lot at first, but compare it to the total value of non-default loans! With this, we are ready to start talking about our acceptance strategy going forward.

4.8 Making the strategy table

Before we implement a strategy, we should first create a strategy table containing all the possible acceptance rates we wish to look at along with their associated bad rates and threshold values. This way, we can begin to see each part of our strategy and how it affects our portfolio.

Automatically calculating all of these values only requires a for loop, but requires many lines of python code.

# create empty lists for thresholds and bad rates - to be appended
thresholds = []
bad_rates = []

# let's set the accept rates that we would like to compare calculations for
accept_rates = [1.0, 0.95, 0.9, 0.85, 0.8, 0.75, 0.7, 0.65, 0.6, 0.55, 0.5, 0.45, 0.4, 0.35, 0.3, 0.25, 0.2, 0.15, 0.1, 0.05]

# Populate the arrays for the strategy table with a for loop
for rate in accept_rates:
    
    # Calculate the threshold for the acceptance rate
    thresh = np.quantile(preds_df_gbt['prob_default'], rate).round(3)
    
    # Add the threshold value to the list of thresholds
    thresholds.append(np.quantile(preds_df_gbt['prob_default'], rate).round(3))
    
    # Reassign the loan_status value using the threshold
    test_pred_df['pred_loan_status'] = test_pred_df['prob_default'].apply(lambda x: 1 if x > thresh else 0)
    
    # Create a set of accepted loans using this acceptance rate
    accepted_loans = test_pred_df[test_pred_df['pred_loan_status'] == 0]
    
    # Calculate and append the bad rate using the acceptance rate
    bad_rates.append(np.sum((accepted_loans['true_loan_status']) / len(accepted_loans['true_loan_status'])).round(3))

# Instantiate the values for thresholds and bad rates
thresholds = [1.0, 0.992, 0.976, 0.804, 0.254, 0.178, 0.138, 0.111, 0.093, 0.078, 0.066, 0.055, 0.045, 0.037, 0.03, 0.022, 0.015, 0.008, 0.004, 0.002]
bad_rates = [0.219, 0.179, 0.132, 0.083, 0.061, 0.052, 0.043, 0.036, 0.03, 0.027, 0.023, 0.02, 0.017, 0.014, 0.01, 0.008, 0.005, 0.001, 0.0, 0.0]

# Create a data frame of the strategy table
strat_df = pd.DataFrame(zip(accept_rates, thresholds, bad_rates),
                        columns = ['Acceptance Rate','Threshold','Bad Rate'])

# Print the entire table
strat_df

	Acceptance Rate	Threshold	Bad Rate
0	1.00	1.000	0.219
1	0.95	0.992	0.179
2	0.90	0.976	0.132
3	0.85	0.804	0.083
4	0.80	0.254	0.061
5	0.75	0.178	0.052
6	0.70	0.138	0.043
7	0.65	0.111	0.036
8	0.60	0.093	0.030
9	0.55	0.078	0.027
10	0.50	0.066	0.023
11	0.45	0.055	0.020
12	0.40	0.045	0.017
13	0.35	0.037	0.014
14	0.30	0.030	0.010
15	0.25	0.022	0.008
16	0.20	0.015	0.005
17	0.15	0.008	0.001
18	0.10	0.004	0.000
19	0.05	0.002	0.000

That for loop was a lot of code, but look at the strategy table we have now. This uses our specific predictions on the credit data, and can be used to see the acceptance rates, bad rates, and financial impact all at once. One of these values has the highest estimated value.

4.9 Visualizing the strategy

Now we have the extended strategy table strat_df. The table is not so big that it’s difficult to analyze, but visuals can help us see the overview all at once.

We should check the distribution of each column with a box plot. If the distribution of Acceptance Rate looks the same as the Bad Rate column, that could be a problem. That means that the model’s calibration is likely much worse than we thought.

We can also visualize the strategy curve with a line plot. The Acceptance Rate would be the independent variable with the Bad Rate as the dependent variable.

# Visualize the distributions in the strategy table with a boxplot
strat_df.boxplot()
plt.show()

The boxplots above show us the distribution for each column.

# Plot the strategy curve
plt.plot(strat_df['Acceptance Rate'], strat_df['Bad Rate']) 
plt.plot()
plt.xlabel('Acceptance Rate')
plt.ylabel('Bad Rate')
plt.title('Acceptance and Bad Rates')
plt.show()

The bad rates are very low up until the acceptance rate 0.6 where they suddenly increase. This suggests that many of the accepted defaults may have a prob_default value between 0.6 and 0.8.

4.10 Estimated value profiling

The strategy table, strat_df, can be used to maximize the estimated portfolio value and minimize expected loss. Extending this table and creating some plots can be very helpful to this end.

The strat_df data frame is loaded and has been enhanced already with the following columns:

# Create a line plot of estimated value
plt.plot(strat_df['Acceptance Rate'],strat_df['Estimated Value'])
plt.title('Estimated Value by Acceptance Rate')
plt.xlabel('Acceptance Rate')
plt.ylabel('Estimated Value')
plt.show()

# Print the row with the max estimated value
print(strat_df.loc[strat_df['Estimated Value'] == np.max(strat_df['Estimated Value'])])

Interesting! With our credit data and our estimated averag loan value, we clearly see that the acceptance rate 0.85 has the highest potential estimated value. Normally, the allowable bad rate is set, but we can use analyses like this to explore other options.

4.11 Total expected loss

It’s time to estimate the total expected loss given all our decisions. The data frame test_pred_df has the probability of default for each loan and that loan’s value. Use these two values to calculate the expected loss for each loan. Then, we can sum those values and get the total expected loss.

For this exercise, we will assume that the exposure is the full value of the loan, and the loss given default is 100%. This means that a default on each the loan is a loss of the entire amount.

# Print the first five rows of the data frame
print(test_pred_df.head())

# Calculate the bank's expected loss and assign it to a new column
test_pred_df['expected_loss'] = test_pred_df['prob_default'] * test_pred_df['loss_given_default'] * test_pred_df['loan_amnt']

# Calculate the total expected loss to two decimal places
tot_exp_loss= round(np.sum(test_pred_df['expected_loss']),2)

# Print the total expected loss
print('Total expected loss: ', '${:,.2f}'.format(tot_exp_loss))

Total expected loss: $27,084,153.38

This is the total expected loss for the entire portfolio using the gradient boosted tree. 27 million US dollars may seem like a lot, but the total expected loss would have been over 28 million US dollars with the logistic regression. Some losses are unavoidable, but our work here might have saved the company a million dollars!

Key takeaways

Our first step was to prepare credit data for machine learning models:

important to understand the data
improving the data allows for high performing simple models

We then developed, scored and now understand Logistic Regression and Gradient Boosted Trees:

these models are simple and explainable
their performance on probabilities is acceptable

We then analyzed the performance of models by changing the data, and now understand the financial impact of results.

We implemented the model with an understanding of strategy. The models and framework covered:

discrete-time hazard model (point in time): the probability of default is a point-in-time event
structural model framework: the model explains the default even based on other factors

Many financial sectors prefer model interpretability:

complex or black-box models are a risk because the business cannot explain their decisions fully. This is important, particularly when considering loan applications. The customer has a right to know on which basis their application was rejected
deep neural networks are often too complex

Leaving thoughts and acknowledgements

Focus on the data:

gather as much data as possible
use many different techniques to prepare and enhance data
learn about the business
increase value through data

Model complexity can be a double-edged sword:

really complex models may perform well, but are seen as a black box
in many cases, business users will not accept a model they cannot understand
complex models can be very large and difficult to put into production

Thanks to Michael Crabtree for creating this course on DataCamp. It explains how to interpret the coefficients and intercepts of a Logistic Regression model particularly well in my view.